Expanding the Colorectal Cancer Biomarkers Based on the Human Gut Phageome

ABSTRACT With the increasing prevalence of colorectal cancer (CRC), extending the present biomarkers for the diagnosis of colorectal cancer is crucial. Previous studies have highlighted the importance of bacteriophages in gastrointestinal diseases, suggesting the potential value of gut phageome in early CRC diagnostic. Here, based on 317 metagenomic samples of three discovery cohorts collected from China (Hong Kong), Austria, and Japan, five intestinal bacteriophages, including Fusobacterium nucleatum, Peptacetobacter hiranonis, and Parvimonas micra phages were identified as potential CRC biomarkers. The five CRC enriched bacteriophagic markers classified patients from controls with an area under the receiver-operating characteristics curve (AUC) of 0.8616 across different populations. Subsequently, we used a total of 80 samples from China (Hainan) and Italy for validation. The AUC of the validation cohort is 0.8197. Moreover, to further explore the specificity of the five intestinal bacteriophage biomarkers in a broader background, we performed a confirmatory meta-analysis using two inflammatory bowel disease cohorts, ulcerative colitis (UC) and Crohn's disease (CD). Excitingly, we observed that the five CRC-enriched phage markers also exhibited high discrimination in UC (AUC = 78.02%). Unfortunately, the five CRC-rich phage markers did not show high resolution in CD (AUC = 48.00%). The present research expands the potential of microbial biomarkers in CRC diagnosis by building a more accurate classification model based on the human gut phageome, providing a new perspective for CRC gut phagotherapy. IMPORTANCE Worldwide, by 2020, colorectal cancer has become the third most common cancer after lung and breast cancer. Phages are strictly host-specific, and this specificity makes them more accurate as biomarkers, but phage biomarkers for colorectal cancer have not been thoroughly explored. Therefore, it is crucial to extend the existing phage biomarkers for the diagnosis of colorectal cancer. Here, we innovatively constructed a relatively accurate prediction model, including: three discovery cohorts, two additional validation cohorts and two cross-disease cohorts. A total of five possible biomarkers of intestinal bacteriophages were obtained. They are Peptacetobacter hiranonis Phage, Fusobacterium nucleatum animalis 7_1 Phage, Fusobacterium nucleatum polymorphum Phage, Fusobacterium nucleatum animalis 4_8 Phage, and Parvimonas micra Phage. This study aims at identifying fine-scale species-strain level phage biomarkers for colorectal cancer diseases, so as to expand the existing CRC biomarkers and provide a new perspective for intestinal phagocytosis therapy of colorectal cancer.

To submit your modified manuscript, log onto the eJP submission site at https://spectrum.msubmit.net/cgi-bin/main.plex. Go to Author Tasks and click the appropriate manuscript title to begin the revision process. The information that you entered when you first submitted the paper will be displayed. Please update the information as necessary. Here are a few examples of required updates that authors must address: • Point-by-point responses to the issues raised by the reviewers in a file named "Response to Reviewers," NOT IN YOUR COVER LETTER. • Upload a compare copy of the manuscript (without figures) as a "Marked-Up Manuscript" file. • Each figure must be uploaded as a separate file, and any multipanel figures must be assembled into one file. For complete guidelines on revision requirements, please see the Instructions to Authors at [link to page]. Submissions of a paper that does not conform to Microbiology Spectrum guidelines will delay acceptance of your manuscript.
Please return the manuscript within 60 days; if you cannot complete the modification within this time period, please contact me. If you do not wish to modify the manuscript and prefer to submit it to another journal, please notify me of your decision immediately so that the manuscript may be formally withdrawn from consideration by Microbiology Spectrum.
If you would like to submit an image for consideration as the Featured Image for an issue, please contact Spectrum staff.
If your manuscript is accepted for publication, you will be contacted separately about payment when the proofs are issued; please follow the instructions in that e-mail. Arrangements for payment must be made before your article is published. For a complete list of Publication Fees, including supplemental material costs, please visit our website.

Comments for Manuscript number: Spectrum00090-21
Title: " Expanding the colorectal cancer biomarkers based on the human gut phageome "

Reviewer #1 (Comments for the Author):
Siyuan Shen and his/her colleagues established a high-performance diagnostic model for colorectal cancer based on three training cohorts and four validation cohorts of whole metagenome sequencing data. This is an exciting work since it reveals the unique potential of the virus in the microbiome in translational medicine. Given that the newly discovered viral biomarkers for CRC can be interested in clinicians, a meta-analysis on exploring this issue should be of interest to the growing community of researchers in the field of microbiome research. I think this paper can be well improved by addressing the bellowing major concerns: Response: We appreciate the reviewer's insightful comments which allowed us to improve the manuscript. Please find our point-to-point responses below.
1. Compared with previous work (such as biomarkers identified in metagenome, methylation, and metabolome for CRC), the author needs to fully discuss the unique value of using phageome to discover biomarkers.
Response: We appreciate your very insightful concern. The discovery of phage biomarkers has a unique value compared to biomarkers identified in metagenome, methylation, and metabolome for CRC. By associating the composition of intestinal phages with the occurrence of inflammatory bowel disease, Jason M. Norman and his / her colleagues found that the number and diversity of bacteria in the intestines of patients with inflammatory bowel disease were low. However, the number, abundance and diversity of some phages are increased. This indicates that this change of phage is independent of the change of bacteria [1]. At the same time, the number of bacteria in human intestine is about 10 14 , while the number of bacteriophages is about 10 15 -10 16

Sequence data collection
Fecal metagenomic data for CRC and control were collected for the meta-analysis. For discovery cohorts, raw SRA files and sample information were downloaded from NCBI. In the NCBI, accession of China (Hong Kong) 1 is PRJEB10878, CRC, n = 74; Control, n = 54. In the NCBI, accession of Japan 2 is DRA006684, CRC, n = 40; Control, n = 40. In the NCBI, accession of Austria 3 is ERP008729, CRC, n = 46; Control, n = 63 (Table 1). For validation cohorts, raw SRA files and sample information were downloaded from NCBI. In the NCBI, accession of China (Hainan) 4 is PRJNA663646, CRC, n = 8; Control, n = 12. In the NCBI, accession of Italy 5 is SRP136711, CRC, n = 32; Control, n = 28 (Table 1). At the same time, the SRA files and sample information for the other disease validation cohorts we used were also downloaded from NCBI.

Selection of bacteriophage biomarkers and application of machine learning
Five CRC bacteriophage biomarkers were identified using the random forest (RF) model 9 and the Wilcoxon Rank-sum test. We used the random forest model to search for biomarkers from 142809 intestinal phages, and applied the R package "Ranger" (V0.12.1) to implement the random forest algorithm for each classification task. All the hyperparameters were set as default except for the number of trees set to 5000. The predictive performance of the RF model was evaluated by the cross-validation method ten-fold, and five bacteriophage biomarkers with the contribution rate >0.1% were identified. At the same time, we used the Wilcoxon Rank-sum test to search for phages with significant difference (p <0.0001) between CRC patients and healthy people in three discovery cohorts, and combined analysis was performed on the differential phages found in three discovery cohorts. Five phages were found that were significantly different in all three cohorts and were enriched in the intestinal tract of CRC patients. Interestingly, the five biomarkers found by the random forest model were the same as the five biomarkers found by the Wilcoxon Rank-sum test. Therefore, we set these five phages as biomarkers, and their AUC reached 86.16.

Statistics statement
All statistical analyses were performed using R software. Vioplot was shown by the "vioplot" package. PCOA analysis was performed using the "ade4" package in R. The differential abundances of various profiles were tested with the Wilcoxon rank-sum test and were considered significantly different at p < 0.05. Boxplot was shown by the "ggplot2" package. Receiver operator characteristic (ROC) analysis was used to assess the performance of the microbial biomarkers using the "pROC" package in R. The Venn diagram was built using an online tool called "Omicstudio". 3. Biological interpretation of the viral biomarkers or their associations with previous studies are highly warranted to enrich the context and offer a comprehensive story to the readers.
Response: Thank you for your concern. Studies have shown that the number of bacteriophages is higher than that of bacteria in the intestines of patients with inflammatory bowel disease, and the changes of bacteriophages in the intestines are independent of the changes of bacteria. Response: Thank you for your insightful comment. We've used the software Grammar to polish it.
I believe the article will be improved to a certain extent.
Minor comments for abstract and introduction: Title: "the human gut phageome" is more in line with grammatical habits.
Response: Thanks for your concern, we have modified this point per your suggestion.
Line 21: "With the increasing proportion of colorectal cancer ..." should be "With the increasing prevalence of colorectal cancer ...".
Response: Thanks for your concern, we have modified this point per your suggestion. Response: Thanks for your concern, we have modified this point per your suggestion.
Line 25: This sentence may go as "based on 317 metagenomic samples of three discovery cohorts collected from China (Hong Kong), Austria, and Japan ...".
Response: Thanks for your concern, we have modified this point per your suggestion.
Line 30: It should be "across different populations" without "the".
Response: Thanks for your concern, we have modified this point per your suggestion.
Response: Thanks for your concern, we have modified this point per your suggestion.
Line 36: "Excitingly, we observed that the five CRC-enriched phage markers also exhibited high discrimination in UC (AUC = 80.03%) and CD (AUC = 71.76%)" is confusing, the biomarker can discriminate CRC vs. other types of samples in UC and CD cohorts instead of classifying UC and CD.
Response: Thanks for your concern. What we want to express is that the five phage biomarkers we found are highly accurate and specific for distinguishing between CRC patients, not only healthy people and CRC patients, but also CRC patients and IBD patients (we believe that CRC and IBD have a certain degree of similarity).
Line 37: Can be easier to understand as: "The present research expands the potential of microbial biomarkers in CRC diagnosis by building a more accurate classification model based on the human gut phageome, providing a new perspective for CRC gut phagotherapy." Response: Thanks for your concern, we have modified this point per your suggestion.
Line 42: "Phages are strictly host-specific, but phage biomarkers for ..." the logic here is not clear.
Response: Thank you for your comment. This sentence is really not clear enough in our logic. We have changed it to: Phages are strictly host-specific, and this specificity makes them more accurate as biomarkers，but phage biomarkers for colorectal cancer have not been thoroughly explored. Response: Thanks for your concern, we have modified this point per your suggestion.
Line 60: The argument is too strong and doesn't fit the fact, please revise.
Response: Thanks for your concern, we have modified this point per your suggestion.
Line 61: Should be revised as "The individuals of bacteriophage are more than bacteria in microbiota ..." Response: Thanks for your concern, we have modified this point per your suggestion.
Response: Thanks for your concern, we have modified this point per your suggestion.
Response: Thanks for your concern, we have modified this point per your suggestion.
Response: Thanks for your concern, we have modified this point per your suggestion.

Reviewer #2 (Comments for the Author):
The study took the public datasets and re-analyzed for bacteriophage and found distinctive phage signatures between control and CRC fecal samples, which could also separate IBD samples with decent accuracy. The study is certainly useful and very interesting. However, I also have some major concerns: Response: We appreciate the reviewer's insightful comments which allowed us to improve the manuscript. Please find our point-to-point responses below.
1. the results are thin. The whole manuscript has only a single figure. There are much more analyses and content to dig. Some are listed in the below comments.
Response: Thanks for your concern. I admit that our article is short and there is only one picture.
This is because the submission section is Observations, which requires no more than 1200 words, no more than 2 figures/tables (our paper includes one figure and one table), and no more than 25 references. We will respond to the comments below point by point.

Sequence data collection
Fecal metagenomic data for CRC and control were collected for the meta-analysis. For discovery cohorts, raw SRA files and sample information were downloaded from NCBI. In the NCBI, accession of China (Hong Kong) 1 is PRJEB10878, CRC, n = 74; Control, n = 54. In the NCBI, accession of Japan 2 is DRA006684, CRC, n = 40; Control, n = 40. In the NCBI, accession of Austria 3 is ERP008729, CRC, n = 46; Control, n = 63 (Table 1). For validation cohorts, raw SRA files and sample information were downloaded from NCBI. In the NCBI, accession of China (Hainan) 4 is PRJNA663646, CRC, n = 8; Control, n = 12. In the NCBI, accession of Italy 5 is SRP136711, CRC, n = 32; Control, n = 28 (Table 1). At the same time, the SRA files and sample information for the other disease validation cohorts we used were also downloaded from NCBI.

Data quality control and phage database acquisition
Step

Selection of bacteriophage biomarkers and application of machine learning
Five CRC bacteriophage biomarkers were identified using the random forest (RF) model 9 and the Wilcoxon Rank-sum test. We used the random forest model to search for biomarkers from 142809 intestinal phages, and applied the R package "Ranger" (V0.12.1) to implement the random forest algorithm for each classification task. All the hyperparameters were set as default except for the number of trees set to 5000. The predictive performance of the RF model was evaluated by the cross-validation method ten-fold, and five bacteriophage biomarkers with the contribution rate >0.1% were identified. At the same time, we used the Wilcoxon Rank-sum test to search for phages with significant difference (p <0.0001) between CRC patients and healthy people in three discovery cohorts, and combined analysis was performed on the differential phages found in three discovery cohorts. Five phages were found that were significantly different in all three cohorts and were enriched in the intestinal tract of CRC patients. Interestingly, the five biomarkers found by the random forest model were the same as the five biomarkers found by the Wilcoxon Rank-sum test. Therefore, we set these five phages as biomarkers, and their AUC reached 86.16.

Statistics statement
All statistical analyses were performed using R software. Vioplot was shown by the "vioplot" package. PCOA analysis was performed using the "ade4" package in R. The differential abundances of various profiles were tested with the Wilcoxon rank-sum test and were considered significantly different at p < 0.05. Boxplot was shown by the "ggplot2" package. Receiver operator characteristic (ROC) analysis was used to assess the performance of the microbial biomarkers using the "pROC" package in R. The Venn diagram was built using an online tool called "Omicstudio". Response: Thanks for your query. The classification accuracy of the reported bacterial biomarkers can reach 80%, while the classification accuracy of the phage biomarkers we found can reach 86.16%. Bacteriophages have value as biomarkers alone. Studies have shown that the number of bacteriophages is higher than that of bacteria in the intestines of patients with inflammatory bowel disease, and the changes of bacteriophages in the intestines are independent of the changes of bacteria. Besides, some studies show that the virome is a candidate for contributing to, or being a biomarker for, human inflammatory bowel disease and speculate that the enteric virome may play a role in other diseases [1]. Therefore, we believe that phages are valuable as biomarkers.
[ 3. what about the other phages? Little are mentioned and discussed.
Response: Thanks for your concern. Our five CRC phage biomarkers were found simultaneously using the random forest model and the Wilcoxon rank-sum test. Among them, we used the random forest model to get 100 possible biomarkers, but we only selected 5 biomarkers with the contribution rate >0.1% as our research objects, so we rarely mentioned and discussed other possible phage biomarkers. The 100 possible biomarker contributions will be provided in supplementary table 1. 4. are these phages in latent form (prophage) or lytic phage? can this be bioinformatically analyzed?
Response: Thanks for your query. Based on the data available so far, it is not clear whether these phages are prophages or lytic phages. We compared the names and related information of the phages in the database, and the database did not specify what kind of phages these phages were.
To know what these phages are, we need high-quality reads and bins of next-next-generation sequencing to be able to solve this problem, but our data are now obtained using next generation sequencing. So unfortunately, we don't know exactly what these phages are.
5. The CRC vs. UC/CD can be batch effects or other sequencing artifacts, because CRC and UC/CD are from different cohorts/studies. As we know, different sample processing can create big difference in data.
Response: Thanks for your concern. We acknowledge that different sample processing will indeed produce large data differences. However, the results of our meta-analysis of CRC vs. UC/CD are relatively reliable for the following four reasons: (1)The CRC sample, UC sample and CD sample data we used were all sequenced by whole genome shotgun sequencing on Illumina HiSeq 2000/2500 (Illumina, San Diego, USA) platform.
(2)We carried out strict standardized quality control for all data used in meta-analysis. (3)We analyze and process CRC data, UC data and CD data in the same way. (4)We conducted cluster analysis on CRC cohort, UC cohort and CD cohort, and found that there was no significant difference among the three cohorts.
1st Revision -Editorial Decision We carefully reviewed your responses to the comments and think a few of them are not sufficiently satisfactory. Particularly, 1) The comparison of disease classification using phage vs. bacteria. You cited the accuracies from literature, which is not a fair comparison. For completeness of the study, we encourage you to add a figure to Figure 1 to show the prediction of bacteria profile on these exact same data sets.
2) The batch effect still exists even with same sequencing platform and bioinformatic pipeline. You PCoA plot actually show the 3 datasets were grouped. You could try to remove the batch effects or avoid the claims that require combining all the datasets together. With these two points further addressed, your manuscript is readily accepted for MS.
Thank you for submitting your manuscript to Microbiology Spectrum. When submitting the revised version of your paper, please provide (1) point-by-point responses to the issues raised by the reviewers as file type "Response to Reviewers," not in your cover letter, and (2) a PDF file that indicates the changes from the original submission (by highlighting or underlining the changes) as file type "Marked Up Manuscript -For Review Only". Please use this link to submit your revised manuscript -we strongly recommend that you submit your paper within the next 60 days or reach out to me. Detailed information on submitting your revised paper are below.

Link Not Available
Thank you for the privilege of reviewing your work. Below you will find instructions from the Microbiology Spectrum editorial office and comments generated during the review.
The ASM Journals program strives for constant improvement in our submission and publication process. Please tell us how we can improve your experience by taking this quick Author Survey. For complete guidelines on revision requirements, please see the journal Submission and Review Process requirements at https://journals.asm.org/journal/Spectrum/submission-review-process. Submissions of a paper that does not conform to Microbiology Spectrum guidelines will delay acceptance of your manuscript. " Please return the manuscript within 60 days; if you cannot complete the modification within this time period, please contact me. If you do not wish to modify the manuscript and prefer to submit it to another journal, please notify me of your decision immediately so that the manuscript may be formally withdrawn from consideration by Microbiology Spectrum.
If your manuscript is accepted for publication, you will be contacted separately about payment when the proofs are issued; please follow the instructions in that e-mail. Arrangements for payment must be made before your article is published. For a complete list of Publication Fees, including supplemental material costs, please visit our website.
1. The comparison of disease classification using phage vs. bacteria. You cited the accuracies from literature, which is not a fair comparison. For completeness of the study, we encourage you to add a figure to Figure 1 to show the prediction of bacteria profile on these exact same data sets. Response: We appreciate your very insightful concern. We decided to adopt your suggestion and add the prediction of bacterial distribution on the same data set to the article and Figure 1. We believe that your suggestion not only made our article more complete, but also made our article more profound.
2. The batch effect still exists even with same sequencing platform and bioinformatic pipeline. You PCoA plot actually show the 3 datasets were grouped. You could try to remove the batch effects or avoid the claims that require combining all the datasets together. Response: Thanks for your concern. We agree to your suggestion. To eliminate batch effect completely, these samples should be resequenced on the same platform and in the same batch, but this was clearly not possible in this study. Therefore, we are more willing to adopt your second suggestion "avoid the claims that require combining all the datasets together". Therefore, we give up cluster analysis by combining three datasets together.When analyzing the specificity of phage biomarkers, we carried out strict quality control on all data, annotated the biomarkers with random forest method, and then calculated the specificity of these biomarkers with the application of an area under the receiver-operating characteristic curve.
1. The comparison of disease classification using phage vs. bacteria. You cited the accuracies from literature, which is not a fair comparison. For completeness of the study, we encourage you to add a figure to Figure 1 to show the prediction of bacteria profile on these exact same data sets.
(Please describe how the bacterial profile were computed.) Response: We appreciate your very insightful concern. We decided to adopt your suggestion and add the prediction of bacterial distribution on the same data set to the article and Figure 1. The calculation method of bacterial abundance and screening method of bacterial biomarkers have been described in materials and methods. We believe that your suggestion not only made our article more complete, but also made our article more profound.
2. The batch effect still exists even with same sequencing platform and bioinformatic pipeline. You PCoA plot actually show the 3 datasets were grouped. You could try to remove the batch effects or avoid the claims that require combining all the datasets together.(Correct batch effects for Fig 1G/H/I.) Response: We are very grateful to you for raising the issue of batch effect, which is very helpful to us. Batch effects are unwanted data variations that may obscure biological signals, leading to bias or errors in subsequent data analyses. Effective evaluation and elimination of batch effects are necessary for omics data analysis. Therefore, PVCA method in the online tool "BatchServer" was used to evaluate the batch effect of sample data 1 , and we found that the data did have a high batch effect, as shown in the figure below. So we used ComBat in the SVA package to remove batch effects from the sample data. ComBat is a widely used method of eliminating batch effects. The results did differ slightly after batch effect was removed from those without it. Methods and results of removing batch effect are presented in this paper. After removing the batch effect, the validation cohort also had high accuracy, with an AUC of 81.97%. However, for the specificity of five CRC phage biomarkers, the results changed significantly after the batch effect was removed. The specificity of 5 phage biomarkers against CRC and UC was high, and the AUC reached 78.02%. However, the 5 phage biomarkers had no specificity for CRC and CD, and the AUC was only 48.00%. Although the 5 CRC phage biomarkers are not specific for CD, this is also interesting to some extent and we decided to retain this conclusion. Finally, thank you again for your suggestions on batch effect, which sublimated our article and made the results more accurate.